Support Vector Machines (SVM)#

Introduction#

Let’s break down Support Vector Machines (SVMs) in simple terms. Imagine you have a bunch of dots on a piece of paper, and these dots belong to two different groups, let’s say red and blue. Now, you want to draw a line on the paper in such a way that it separates the red dots from the blue dots as much as possible.

Support Vector Machines do something similar in a more advanced way. Instead of just a line, SVMs find the best possible “boundary” (could be a line, plane, or more complex shape depending on the data) to separate different groups. The key is to find a boundary in such a way that there is as much space as possible between the dots of different groups.

Now, the “support vectors” in SVM are the dots that are closest to the boundary. Think of them as the most important dots for figuring out where the boundary should go. The SVM aims to maximize the distance between the support vectors and the boundary.

In simple terms, SVM helps classify things into different groups by finding the best way to draw a boundary between them, considering the support vectors and maximizing the space between groups. It’s like finding the optimal line or curve that best separates different types of data points on a graph.

Tip

In the context of our example, boundary or line that separates these two groups is hyperplane. Space between the boundary (the line that separates the two groups) and the nearest dots of each group is margin. Both of these terms will be described below

Hyperplanes#

Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.

Margin#

The margin is the distance between the hyperplane and the closest data points from either class. The optimal hyperplane is the one that maximizes the margin between the two classes. The margin is important because it helps the model generalize better on unseen data.

Understanding the mathematics behind SVM#

Many people skip the math intuition behind this algorithm because it is pretty hard to digest. Here in this section, we’ll try to understand each and every step working under the hood.

Use of Dot Product in SVM#

Consider a random point X and we want to know whether it lies on the right side of the plane or the left side of the plane (positive or negative).

sfsdgeg

To find this first we assume this point is a vector \(x\) and then we make a vector \(w\) which is perpendicular to the hyperplane. Let’s say the distance of vector \(w\) from origin to decision boundary is \(b\). Now we take the projection of \(x\) vector on \(w\).

sfsdgeg

We already know that projection of any vector or another vector is called dot-product. Hence, we take the dot product of \(x\) and \(w\) vectors. If the dot product is greater than ‘\(b\)’ then we can say that the point lies on the right side. If the dot product is less than ‘\(b\)’ then the point is on the left side and if the dot product is equal to ‘\(b\)’ then the point lies on the decision boundary.

\(\overrightarrow{x} \cdot \overrightarrow{w} = b\) (the point lies on the decision boundary)
\(\overrightarrow{x} \cdot \overrightarrow{w} > b\) (positive samples)
\(\overrightarrow{x} \cdot \overrightarrow{w} > b\) (negative samples)

Equation of the Hyperplane, Margin#

This equation is derived from two-dimensional vectors. But in fact, it also works for any number of dimensions. Equation of the hyperplane:

\[H = \langle x, w \rangle - b\]

Distance from a data point to the hyperplane or margin:

\[M = \frac{1}{\Vert w \Vert}\]

Classification problem#

Now based on previous information, let’s formulate SVM classification problem.

In task of SVM classification problem where \(x = \mathbb{R}^p\) and \(y = \{-1, +1\}\), we aim to find parameters \(w\) in \(\mathbb{R}^p\) and \(b\) in \(\mathbb{R}\). The classification algorithm \(a(x, w)\) is defined as \(\text{sign}(\langle x, w \rangle - b)\).

The sign function is used to determine whether a point lies on one side or the other of the hyperplane. If \(a(x,w)>0\) the point is classified as belonging to class +1, and if \(a(x,w)<0\), the point is classified as belonging to class −1.

In SVM, the objective is to find the hyperplane that maximizes margin

Now let’s see some visualization of hyperplane on randomly generated data

Maximazing margin#

Take a look to the plot below.

Let’s define \(H_0\), \(H_1\), \(H_2\) as planes:

\(H_0\): \(\langle w, x_i \rangle+b = 0\)

\(H_1\): \(\langle w, x_i \rangle+b = 1\)

\(H_2\): \(\langle w, x_i \rangle+b = -1\)

Recall the distance from a point to a line: \(w_1x1+w_2x2+b = 0\) is:

\[\frac {|w_1x1_0 +w_2x2_0 + b|}{\sqrt(w_1^2+w_2^2)}\]

The distance between \(H_0\) and \(H_1\)(margin) is then:

\[\frac{\vert\langle w, x \rangle +b\vert}{\Vert w \Vert}=\frac 1{\Vert w \Vert}\]

The total distance between \(H1\) and \(H2\) is thus: \(\frac 2{\Vert w \Vert}\)

So, to maximaze margin, we should minimize \(\Vert w \Vert\)

Minimization of \(\Vert w \Vert\) is often achieved through the process of optimization, which will not be covered in this chapter

_images/4839ffbb194b8977987c1a73812b2790c1eadbf988f89dccf13cdaddb3314216.png

Now let’s take a look on work of linear SVM on Iris Dataset

_images/fe86cc6a1f1ec3dc2cd0befd0bf1afa14e2c0fac1291a5e599eca3f575af6293.png

Kernels in Support Vector Machine#

The most interesting feature of SVM is that it can even work with a non-linear dataset and for this, we use “Kernel Trick” which makes it easier to classifies the points. Suppose we have a dataset like this:

sfsdgeg

Here we see we cannot draw a single line or say hyperplane which can classify the points correctly. So what we do is try converting this lower dimension space to a higher dimension space using some quadratic functions which will allow us to find a decision boundary that clearly divides the data points. These functions which help us do this are called Kernels and which kernel to use is purely determined by hyperparameter tuning. Among the array of kernels, the polynomial kernel stands out as a versatile choice.

sfsdgeg

Polynomial kernel#

The polynomial kernel, expressed as \(K(xi,xj)=(\langle x_i,xj \rangle+1)^p\), introduces a tunable parameter \(p\), determining the degree of the polynomial.

Mathematically, the kernel function computes the dot product between the mapped data points in the higher-dimensional space without explicitly calculating the coordinates of those points

Unlike the linear SVM, the polynomial kernel enables the SVM to operate in a more complex feature space, accommodating intricate relationships within the data.

Reminder

The addition of 1 to the dot product in the polynomial kernel is a technique known as the “bias trick” or “offset trick.” This addition is introduced to ensure that even when the dot product \(x_i\) and \(x_j\)​ is zero, the resulting value after adding 1 will be non-zero

Let’s see visualization that underlines this feature

Note

Evaluating \(K\) only requires one addition and one exponentiation more than the original dot product

As we can see, with the help of kernal, two-dimensional data was transferred to three-dimensional space, which made it possible to classify data that would not have been possible to classify with a conventional linear SVM